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Abstract 

In this paper, we review and apply several approaches to model selection for analysis of variance models 
which are used in a credibility and insurance context. The reversible jump algorithm is employed for 
model selection, where posterior model probabilities are computed. We then apply this method to insurance 
data from workers' compensation insurance schemes. The reversible jump results are compared with the 
Deviance Information Criterion, and are shown to be consistent. 
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1 Introduction 



In this paper, we address a problem posed by Klugman (1987) We consider an example using the efficient 



proposals reversible jump method. In this example, we consider a complex two way analysis of variance 
model using loss ratio. We introduce alternative models of describing the process and perform model dis- 
crimination using the reversible jump algorithm. 

Throughout our discussion we consider dataiJ which are insurance loss ratios. The motivation for working 
with loss ratios are given by |Hogg and Klugman (1984)| and [Klugman (1987)| The higher levels will reflect 
the group to group variations in the departure from the expected losses. This will be more stable than the 
group to group variations in the absolute level of losses. Also we use normal models since we want to 
compare classical credibility models. By assuming a linear least squares approach, as in classical approach, 
there is a tacit assumption of normality underlying the modelling process. 

Suppose that R^iys are the observed loss ratios, and we seek to minimise the predicted future loss ratios 
Rnew The minimum expected loss is the conditional variance of R^ew given Rohs and this minimum variance 
occurs when the predictor is the regression of 7?„e,v on R,ji,s i.e. the conditional expectation E{R„en,\R,ji,s) ■ 
Using this decision theoretic approach we could specify a collection of candidate models, ^ — {M, } say, 
then construct a decision principle based on some collection of utility functions and select the model which 
minimises the expected loss. In some cases, however, the specification of a utility function is not always 
possible and we must seek alternative approaches. In this paper, we show how an approach based on the 
deviance function can be used for model selection. It is assumed that a collection of plausible models exist, 
and we begin by asking the questions: 
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1. Which model explains the data we have observed? 

2. Which model best predicts future observations? 

3. Which model best describes the underlying process which generated the data? 

We briefly review several perspectives on model selection and the connection between them before presenting 
our models and results. 

2 General Perspective 

We consider joint modelling of the parameter vector 9^ and the model M/^. As noted by [Rubin (1995)| the 
Bayes factor is based on the assumption that one of the models being compared is the true model. However, 
we cannot assume this to be generally true, and we do no make this assumption. |Carlin and Louis (1996)| 
discusses several methods using Markov chain methods for model assessment and selection. We analyse 
credibility models using some of these methods. We consider model selection using posterior model proba- 
bilities based on joint modelling over the model space and parameter space. Prediction is often the ultimate 
goal in credibility theory. We consider model selection using predictive ability and the overall complexity 
of the model. We intend to use a decision theoretic approach to prediction using utility theory. We begin by 
motivating a decision theoretic approach and then show how this approach can be implemented using Markov 
Chain Monte Carlo (MCMC) methods. 

[Bernardo and Smith (1994)| discusses several alternative views of model comparison. They are separated 
into three principal classes. The first is called the ^-closed system; it assumes that one of the models is the 
true model generating the observed data; however, it does not specifying which model is the true model. In 
this case, the marginal likelihood of the data is averaged over the specified model. Thus 

p{R)^ £ p{Mi)p{R\Mi). 



In addition Madigan and Raftery (1994) show that in posterior predictive terms if 7 is a quantity of interest. 



averaging over the candidate models produces better results than relying on any single model. 

K 

?r(7l^) = £p(7|Mi,/?)7r(M,|/?) (1) 

i=i 

where n{Mji\R) is the posterior probability of model M<. given the observed data and 

p{j\Mk,R)^ j p{j%dkMkMek%Mk)ddk. 



(2) 



For a general review of Bayesian modelling averaging, see [Clyde (1999)[ and [Hoeting et al. (1999)[ However, 
when the set of candidate models ^ is not exhaustive, we might not be able to average over all possible 
models. In that context, placing a prior distribution on ^ does not apply, and since we are interested only in 
predicting future unknown values, this might be more appropriate than selecting a single model. 

The second alternative is the so called ^-completed view, which simply seeks to compare a set of 
models which are available at that time. In this case = {M,} simply constitute a range of specified 
models to be compared. From this perspective, assigning the probabilities {f (M,), M; G ^} does not make 
sense and the actual overall model specifies beliefs for R of the form p{R) ~ p{R\Mt). Typically, {M,} will 
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have been proposed largely because they are attractive from the point of view of tractability of analysis or 
communication of results, compared with the actual belief model Mr. 

The third alternative is the ./^-open view. In an ^-open system it is assumed that none of the models 
being considered is the true model which generated the observations. In this case, our goal is to select some 
model or subset of models which best describe the data. For the -completed and ^-open views, assigning 
prior probabilities on the model space ^ is inappropriate since statements like p{Mk) = c do not make sense. 
However, in the ^-open case, there is not separate overall beUef specification. 



3 Decision Theoretic Approach 

|Key et al. (1999) 'argue that any criteria for model comparison should depend on the decision context in which 
the comparison is taking place, as well as the perspective from which the models are viewed. In particular, an 
appropriate utility structure is required, making explicit those aspects of the performance of the model that 
is most important. Using a decision theoretic approach, we can assign utilities to the choice of model M,-, 
m(M, , 7), where 7 is some unknown of interest. The general decision problem is then to choose the optimal 
model, M*, by maximising expected utilities 

u{M*\R) ^ &\ipu{Mi\R), 

Mi 

where 

u{Mi\R) = j u{M„Y)n{Y\R)dY 
with 7i{y\R) representing actual beliefs about 7 after observing R in Equation ([T]i- 



Spiegelhalter et al. (2002) propose their deviance information criterion, DIC, as an alternative to Bayes' 



factors. In Spiegelhalter et al. (2002) the DIC is developed to address how well the posterior might predict 
future data generated by the same mechanism that gave rise to the observed data. Our motivation is that 
likelihood ratio tests cannot be used when there are unobservables, and that they apply only to nested models. 
Also likelihood ratio based tests are inconsistent, since as the sample size tends to infinity, the probability 
that the full model is selected does not approach zero ( IGelfand 1996bl) . 

The likelihood ratio gives too much weight to the higher dimensional model, which motivates the discus- 
sion on penalised likelihoods using penalty functions. A good penalty function should depend on both the 
sample size and the dimension of the parameter vector The decision theoretic approach is general enough to 
include traditional model selection strategies, such as choosing the model with the highest posterior probabil- 
ity. For example in the ^-closed system, where we assume that ^ contains the true model, if we assume a 
utility function of the form 

1 if 7 = Mi 



u{Mi,Y) - 
then from (|2|l 



Q if Y^M„ 

1 if7 = Mi 
ifY^Mi 
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and 



The expected utility is then 



7:{y\R) = 



n{Mi\R) if7 = M,' 







if 7^ Mi. 



u{Mi\R) = J u{Mi,y)n{y\R)dy 
= n{Mi\R). 

Therefore, the optimal decision is to choose the model with the highest posterior probability. For the 
completed case, [Bernardo and Smith (1994)| shows that the cross validation predictive density yields similar 
results. The connection between DIC and the utility approach using cross validation predictive densities, 
has been studied by Vehtari and Lampinen (2002) and |Vehtari (2002)| who use cross validation to estimate 
expected utility directly, and also the effective number of parameters. The main differences are, that cross 
validation can be less numerically stable than the DIC and can also require more computation. However, 
DIC can underestimate the expected deviance. For a list of specific utilities used when choosing models, see 



Key et al. (1999) 



4 Computing Posterior Model Probabilities 
4.1 Reversible Jump Algorithm 

We assume there is a countable collection of candidate models, indexed by M e = {Mi, M2,. . . , M^.}. 
We further assume that for each model M,, there exists an unknown parameter vector 0,- e R"' where n,-, the 
dimension of the parameter vector, can vary with M, . 

Typically, we are interested in finding which models have the greatest posterior probabilities, in addition 
to estimates of their parameters. Thus the unknowns in this modelling scenario will include the model index 
Mi, as well as the parameter vector 0, . We assume that the models and corresponding parameter vectors have 
a joint density n{Mi, 0,). The reversible jump algorithm constructs a reversible Markov chain on the state 
space X Um,g^^"' which has n as its stationary distribution (IGreen 19951 1. In many instances, and in 
particular for Bayesian problems, this joint distribution is 

n{Mi, 0.) = n{Mi, Oi\R) oc L{R\M., 0,) p(M,-, 0,), 

where the prior on (M, , 0, ) is often of the form 

p{Mi,e,)=p{e,\Mi)p{M.) 

with p{Mi) being the density of some counting distribution. 

Suppose we are at model M,, and a move to model Mj is proposed with probability r,y. The corresponding 
move from 0, to Oj is achieved by using a deterministic transformation hij, such that 

{ej,v) = h,j{ei,u), (3) 

where u and v are random variables introduced to ensure dimension matching necessary for reversibility. To 
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ensure dimension matching, we must have 



dim(0y) +dim(v) = dim(0,) +dim(M). 



For discussions about possible choices for the function hij, we refer the reader to |Green (1995)[ and |Brooks et al. (2003)[ 
If we denote the ratio 



n{Mj,ej)q{v) rj, 

n{Mi,ei) q{u) nj 



dhij{Bi,u) 



(4) 



by A(0,-, dj), the acceptance probability for a proposed move from model (M,, 0,) to model (My, dj) is: 

min{l,A(0„0,)} 

where q{u) and q{v) are the respective proposal densities for u and v, and \dhij{Qi,u) /d{9i,u) \ is the Jacobian 
of the transformation induced by hij. It can be shown that the algorithm constructed above is reversible 
(IGreen 19951 1 which, again, follows from the detailed balance equation 



7z{Mi, ei)q{u)nj = TziMj, ej)q{v)rji 



dhij{ei,u) 



d{9i,u) 



Detailed balance is necessary to ensure reversibility and is a sufficient condition for the existence of a unique 
stationary distribution. For the reverse move from model Mj to model M, it is easy to see that the transforma- 
tion used is {9i,u) — hjMd i,v), and the acceptance probability for such a move is 



min < 1, 



n{Mi, dj) q{u) nj 

%{Mj, dj) q{v) rji 



dhij{di,u) 



= min{l,A(0„0,)-i}. 



For inference regarding which model has the greater posterior probability, we can base our analysis on a 
realisation of the Markov chain constructed above. The marginal posterior probability of model M, 



7tiM,\R) 



p{Mt)fiR\M.] 



i:M,e.JlP{Mj)f{R\Mjy 



where 



f{R\Mi) = jL{R\Mi,ei)pidi\Mi)ddi 



is the marginal density of the data after integrating over the unknown parameters 9. In practice, we esti- 
mate n{Mi\R) by counting the number of times the Markov chain visits model M,- in a single long run after 
becoming stationary. 



4.2 Efficient Proposals for TD MCMC 

In practice, the between-model moves can be small resulting in poor mixing of the resulting Markov chain. 
In this section, we discuss recent attempts at improving between-model moves by increasing the acceptance 
probabilities for such moves. Several authors have addressed this problem, including |Troughton and Godsill (1997) 
IGiudici and Roberts (1998)||Godsill (200f)l|Rotondi (2002)| and |Al-Awadhi et al. (2004)1 [Green and Mira(2001)| 
proposes an algorithm so that when between-model moves are first rejected, a second attempt is made. This 
algorithm allows for a different proposal to be generated from a new distribution, that depends on the previ- 
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ously rejected proposal. Methods to improve mixing of reversible jump chains have also been proposed by 
[Green (2002)| and |Brooks et al. (2003)1 these are extended by |Ehlers and Brooks (2002)| 

One strategy proposed by |Brooks et al. (2003)[ and extended to more general cases by |Ehlers and Brooks (2002)] 
is based on making the term Aij {9 i, 9 j) in the acceptance probability for between-model moves given in 
Equation (|4|i, as close as possible to 1 . The motivation is that if we make this term as close as possible to 
1, then the reverse move acceptance governed by l/Aij{9i,9 j) will also be maximised resulting in easier 
between-model moves. In general, if the move from (M,-, 0,) => (M,-, 9 j) involves a change in dimension, the 
best values of the parameters for the densities q{u) and ^(v) in Equation (|4]l, will generally be unknown, even 
if their structural forms are known. 

Using some known point (m, v), which we call the centering point, we can solve A,y(0,, 9j) = 1 to get the 
parameter values for these densities. Setting A,j = 1 at some chosen centering point is called the zeroth-order 
method. Where more degrees of freedom are required, we can expand A,y as a Taylor series about (m, v) and 
solve for the proposal parameters. New parameters are proposed so that the mapping function in Equation (O 
is the identity function, i.e., 

i9j,v)=hij{9„u)^{u,9,) 
and the acceptance ratio term A,j(0,-, 9j) probability in Equation ^ becomes 

niMj, 9j) rj, q{v) 



A,ji9i,9j) = 



7t{Mi, 9i) nj q{u) 
n{Mj,9j) rj, q{9i) 
7Z{Mi,9,) njq{9j)' 



Several authors have proposed simulation methods to construct Markov chains which can explore such 
state spaces. These include the product space formulation given in |Carhn and Chib (1995)[ the reversible 
jump (RJMCMC) algorithm of [Green (1995) ' the jump diffusion method of [Grenander and Miller (1994)[ 



and Phillips and Smith (1996) and the continuous time birth-death method of Stephens (2000) Also for 



particular problems involving the size of the regression vector in regression analysis, there is the stochastic 



search variable selection method of George andMcCuUoch (1993) practice trans-dimensional algorithms 



work by updating model parameters for the current model, then proposing to change models with some 
specified probability. 



4.3 Deviance Information Criterion 

The Die is based on using the residual information in X conditional on 9, defined up to a multiplicative 
constant as —2logL{X\9). If we have some estimate 9 = 9{X) of the true parameter, 9', then the excess 
residual information is 

d{X,9',9) = -2logL{X\9') +2logL{X\9) 

This can be thought of as the reduction in uncertainty due to estimation or the degree of overfitting due to 9 
adapting to the data X. From a Bayesian perspective 9' may be replaced by some random variable G 0. 
Then d{X, 9' ,9) can be estimated by its posterior expectation with respect to n{9\X) denoted 

/7D(X,0,0)=E0|;^fl'(X,0,0) 

= E(-21ogL(X|0))+21ogL(X|0). 
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po is then proposed as the effective number of parameters with respect to a model with focus ©. Thus, if 
we take h{X) as some fully specified standardising term that is a function of the data alone, then po may be 
written as 



PD = D{e)-D{e) 

where 

D{e) = -2logL{X\e)+2logh{X). (5) 

Using Bayes' theorem we have 

which can be viewed as the posterior estimate of the gain in information provided by the data about 0, minus 
the plug-in estimate of the gain in information. Having an estimate for the effective number of parameters, 
Pd, the quantity 

DIC = D{e)+2pD 



= D{d)+pD 

can then be used as a Bayesian measure of fit, which when used in models with negligible prior information 
will be approximately equivalent to the DIC criterion. 

If D{-) in Equation (|5]l is available in closed form, po may easily be computed using samples from an 
MCMC run. This is what we propose to do to measure each models complexity and then rank the models in 
terms of their complexity. Even though we have defined po in terms of the expectation with respect to some 
density, other measures such as the mode or median can be used instead. 



5 Discrimination for ANOVA Type Models 

Quite often, the hierarchical credibility model of [Jewell (1975)| can be formulated as an analysis of variance 
type model. In this paper, we use reversible jump techniques to compute posterior model probabilities and 
compare various analysis of variance models. The reversible jump results are also compared with the results 
obtained by using the DIC. 

Hierarchical models in credibility theory have been considered by |Jewell (1975)1 Taylor (1979) |Zehnwirth (1982)[ 



and |Norberg (1986)| Recent reviews of linear estimation for such models has been presented by |Goovaerts and Hoogstad (1987)| 



and Dannenburg et al. (1996) The results in this paper also have implications for other problems such as the 
claims reserving run-off triangle method, which we have not considered. This formulation has already been 
exploited by |Kremer (1982)| and Ntzoufras and Dellaportas (2002) who use MCMC to estimate claim lags. 



In this paper, we address a problem posed by [Klugman (1987)| and we consider an example using the 
efficient proposals reversible jump method. This example is a complex two-way analysis of variance model 
involving loss ratios . We introduce alternative models for describing the process which generated the data, 
and perform model discrimination using the reversible jump algorithm. 
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This paper contributes to the literature on model discrimination based on reversible jumps for reparame- 
terised Biihlmann-Straub model, a two-way model, and the hierarchical model of [Jewell (1975)| The general 
question is whether there is any advantage gained by using a two-way model rather than a simple random 
effects model in analysing the data. Even though the one-way model is a nested sub-model of the two-way 
model, the resulting parameter estimates can be different under both models since they have different in- 
terpretations. In this example, we see that the the two-way model is vastly superior. In the context of the 
Bayesian paradigm, we are able to derive posterior model probabilities and use these to discriminate between 
competing models. For each algorithm, the between-model moves are augmented with within-model moves 
which can be used to estimated model parameters for each model. 

In Section 15.21 we therefore discuss how the choice of parameterisation affects the convergence of the 
Markov chain algorithm for within-model simulations. The between-model moves are done using the Taylor 
series expansion of the between-model acceptance probabilities near to some point called the centering point. 
In some cases using weak non-identifiable centering does not work well. Another approach, which we 
employ in this example, is the conditional maximisation approach, where the centering point is selected to 
maximise the posterior density. 

5.1 The Basic Two-Way Model 




Figure 1 : Centred parameterisation 




Figure 2: Non-centred parameterisation 



The generic hierarchical model can be described as a connected graph as shown in Figure[T] Let 9 denote 
the collection of parameters, Y represent the observed data, and X can take the role of missing data or other 
possibilities. The algorithm for sampling from the joint distribution of 0, X, given the observed data might 
proceed by alternating 

1 . Update 9 from a Markov chain with stationary distribution 9 \X 

2. Update X from a Markov chain with stationary distribution X\9,Y 

The rate of convergence of the Gibbs sampler is directly related to the choice of parameterisation for such 
problems. On the other hand, we might be able to find an alternative parameterisation, {X, 9) — > {X, 9), of 
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the model in Figure [T] where the new missing data is some function of the previous missing data X and the pa- 
rameters 9, such that X is a priori independent of 9. The type of parameterisation shown in Figure|2]is called 
non-centred parameterisation. The corresponding algorithm for simulating from the posterior distribution of 
{X,9), is then 

1. Update 9 from a Markov chain with stationary distribution 9\X,Y 

2. Update X from a Markov chain with stationary distribution X\9,Y . 

For more general discussions, see |Gelfand and Sahu (1999)| and Papaspiliopoulos et al. (2003) 



The general form of the two-way model considered herein is the non-centred parameterisation: 

yijt=li + ai + Pj + Yij + eij, i= l,...,m;j= l,...,n;f = l,...,s, (6) 

in which there are s replications for factors / and j. The error terms in the observations are assumed to 
be normally distributed and can depend on other known values. Quite often we assume that i = 1. The 
interpretation of this model is that there is some overall level common to all observations, ji, and then there 
are treatment effects that depend on the factors / and j, denoted a,- and j3;, respectively. The jij are the 
interactions between the factors and they are assumed identically equal to zero. 

Bayesian analysis of one-way and two-way models and general mixed linear models are studied by 
IScheffe (1959)1 |Box and Tiao (1973)| and [Smith (1973)] The analysis of [Smith (1973)| is based on the more 
general normal Unear model of [Lindley and Smith (1972)| The error term etp, is assumed to be normally 
distributed with Eip ~ ^ (O, (cT/Siy,)^'), where £■,■,-, is some scale factor associated with observation ytp. 
The effects a,- and j3j are assumed to have prior variances 1/Ta and 1 /Tp, respectively. Similar models have 
been analysed by [Nobile and Green (2000)[ who modelled the factor terms as mixtures of normal distribu- 
tions using reversible jump methods to select the number of components in the mixture. [Ahn et al. (1997)[ 
uses classical methods to compare their models. For the within-model parameter updates, we use the Gibbs 
sampler algorithm. We briefly discuss the choice of parameterisation and how different updating schemes 
can affect the within model convergence properties. 

Before discussing how the choice of parameterisation affects the Gibbs sampler for linear mixed models, 
we note that the centering discussed in this section is related to the parameterisation of the models discussed, 
and not to the choice of centering point discussed in relation to the efficient proposals methods. For example, 
let 

J], = + «/ 
Cij = rii + lij. 

The above stated model could be reparameterised so that 

c,;-^^(77„Tr') 

T7,-^(AI,T2"'). 

This new (/x,77,Q parameterisation is then called the centred parameterisation, since the are centred 
about the rji and the Tj, are also centred about jU. The original {iJ.,a,P) parameterisation in ^ is called the 
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non-centred parameterisation. Partial centerings are also possible, see |Gilks and Roberts (1996)| for further 
discussion. 

5.2 Hierarchical Centering and Gibbs Updating Schemes 

[Gelfand et al. (1995)| consider general parameterisations and a hierarchically centred parameterisation by 
increasing the number of levels in a Bayesian analysis. They show that, if — > with Xa and a fixed, 
then the centred parameterisation will be better. If, however, a with Ta and Tp fixed, then the non- 
centred parameterisation will be better They make no optimality claims for such centerings, and generally 
recommend centering the random effects with the largest posterior variance to improve convergence. Thus, 
in the two-way model, we would centre either the a,s or the jS^s, provided that their variability dominated at 
the data level. In problems where the variance components are unknown this would necessitate a prehminary 
run of the algorithm to determine the variance components. 

[Roberts and Sahu (1997)| show that when the target density is Gaussian a deterministic scheme is most 
optimal for fast convergence of the Gibbs sampling algorithm for a class of structured hierarchical models. 
This updating scheme is also optimal for Gaussian target densities when the components can be arranged 
in blocks and where there is negative partial correlation between the blocks. The model parameters in the 
hierarchically centred parameterisation have different interpretations than those in the non-centred implemen- 
tation, so direct comparison is not possible. We, however, compare both implementations using the methods 
of [Roberts and Sahu (1997)| whose results extend the results of [Gelfand et al. (1995)[ Note that with the 
blocked parameterisation, the a;'s are conditionally independent given jj., /3j and <7. Therefore blocking them 
together does not alter the performance of the Gibbs algorithm. Blocking does not completely overcome the 
problems. 

Block updating of the parameters should result in smaller posterior correlations jAmit and Grenander 19911 
ILiu et al. 19941 l. [Roberts and Sahu (1997)[ and [Whittaker (1990)| show that for the parameterisation given in 
Equation (O, the partial correlation between any component of one block and any component of another 
block, is negative. In this case a random scan Gibbs algorithm or a random permutation Gibbs sampling 
algorithm would be expected to perform better than the deterministic scan algorithm that we use. Where the 
target densities are Gaussian, [Amit and Grenander (1991)] recommend the use random updating strategies. 
However, for unknown variance components, this is not necessarily true. 

When the variance components are unknown, the posterior distribution will cease to be Gaussian. The 
variance components will be included in the model with their respective prior specifications. The Gibbs 
sampler needs to sample from the joint posterior distribution of the jj., a, and j3 and the variance components. 
However, the conditional distribution of jx, a, and j3 given the variance component will still be Gaussian. 
Consequently, the behaviour of the Gibbs sampler should still be guided by the above considerations. 

Another reason for choosing this parameterisation is that it allows for easy implementation in reversible 
jump schemes. It allows us to easily construct algorithms to move between models with no parameters 
in common as we now show, since for the one-way model, the more efficient parameterisation depends 
on the ratio of variances. Thus, we choose this model only because it allows for easier moves in the re- 
versible jump scheme. For general discussions about parameterisation and MCMC implementation in linear 
models, see [Hills and Smith (1992)| [Gilks and Roberts (1996)| [Gelfand etal. (1995)| [Gelfand (1996a)l and 
[Gelfand et al. (200 1)[ The case of generalized linear models is considered by [Gelfand and Sahu (1999)} 

We adopt the non-centred parameterisation for the models analysed in this paper partly because the 
variance components are unknown. Also the non-centred parameterisation seems more readily implemented 
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for reversible jump algorithm, since there are usually fewer model parameters. In addition, for non-centred 
models, the proposal distribution can easily be computed using the efficient proposals methods. 



6 Example : Workers' Compensation Insurance 

In this section we analyse a set of insurance data from a Workers' compensation scheme, using a hierarchical 
random effects model. A typical workers' compensation scheme exists to provide workers who are injured in 
the workplace with a guaranteed source of income, until they recover and re-enter the work-force. 



6.1 The Data and Model Specification 

Our model is fully parametric and can be used to describe data representing workers compensation for 25 
classes of occupations across 10 U.S. States, over a period of 7 years. The losses represent frequency counts 
on workers' compensation insurance on permanent partial disability and the exposures are scaled payroll 
totals that have been inflated to represent constant dollars. We use the first 6 years of data for parameter es- 
timation of the model; the 7th year of data is used to test the accuracy of the predictive distribution obtained. 
We need to estimate the class and occupation parameters, so that we have a basis for estimating future obser- 



vations. The dataset has previously been analysed by Klugman (1992) using numerical approximations. Our 



approach will be hierarchical Bayesian using Markov chain Monte Carlo integration to estimate the model 
parameters. 

The results of Klugman (1992)| are based on matrix analytic arguments and numerical approximations 



of the posterior estimates of the parameters. In particular Klugman (1992) uses the method of Gaussian 
quadrature to approximate the posterior distributions of the model parameters. We present a MCMC based 
analysis based on the loss ratios, defined as loss I exposure. We let 

Lijt = losses for State /, Occupation j for year t 
Eijt ~ exposure for State /, Occupation j for year t 
/=!..., 10, ;■ = !,..., 25, f = l...,7. 

and the corresponding loss-ratios by Rij,, where Rij, — Lij,/Eij,. 

There is one occupation class with Eip — for all / and t ; we removed this value of j from our analysis 
so that data for 24 occupation classes are left. We begin by showing how MCMC can be used to implement 



the original model in Klugman (1992) which is a hierarchically centred model. In our analysis, we repa- 
rameterise the model and employ a non-centred model so that each level can then be compared with the first 
level. Other parameterisations are possible (See, for example, |Venables and Ripley (1999)| . The choice of 
parameterisation does not affect the result, since it is the sum, a, + j3j, that really matters. 

6.2 Short Review of the Klugman Model 

The model described by [Klugman (1992)] which is a special case of [Jewell (1975)[ has first level given by 
Rij,\a,^,a ^ J^{ai + Pj,{aEijt)-^), i = 1,...,10;./ = l,...,24;f = 1,...,6, (7) 
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and prior structure 

.2 



a,|M,Ta--^(5M,T„') /-I..., 10, (8) 



/3,|M,T/3--^(iM,V') i = l,---,24. (9) 

For the hyper-parameters a, Ta, and T|3, we use conjugate and diffuse Gamma (a,/?) priors. For pL, we use a 
diffuse Gaussian prior with mean and variance c^^ . The model is a two-way model where we have made 
the assumption that there is no interaction between classes and occupation. The first level (|7|l reflects what we 
think about the data; that the observations are normally distributed about some mean, which does not change 
with time, but depends only on the class (0 and occupation {]). 

We also assume the variance of any particular observation about its mean is proportional to some measure 
of the exposure. This assumption is popular among insurance practitioners such as |Ledolter et al. (1991)| 
Klugman (1992) and |Ramlau-Hansen (1982)"] The second level comprising Equations dSJ and allows 



for any interaction between the class and occupation parameters a = (ai , . . . , aio)' and j3 = (j8i , . . . , j324)' 
respectively. We assume they are independent, and normally distributed with mean equal to one-half the 
overall mean. There is apparently no special reason for choosing such a prior for the a, or /3j parameters, 
other than their sum should equal the overall mean pL. For our implementation we choose a ^ b = c = 0.001. 

6.3 The Posterior Conditional Distributions 

The joint posterior distribution of the parameters, given the data, up to a constant of proportionality, takes the 
form: 

7r(;U,(7,Ta,T|3,a,j3|E,R)oc 

10 24 

p{fi)p{Ta)p{'^l3)pi(y)]Jp{ai\fi,Ta)YlpiPM,Tj3)YlfiRi}t\ai,pj,a). (10) 

'=1 ;=i i,j.t 

From Equation ( fTOl l. we can determine the following posterior conditional distributions for implementing a 
Gibbs updating scheme: The posterior conditional for a,- is 

7z{ai\iJ.,l5,Ta:<y,'E,R) ^ p{ai\iJ.,Ta)YlfiRijt\ai,Pj,a) 

P 

'Tal^/2 + al^pEij,{Rij,-Pj) 1 



a;|;U,)3,Ta,(7,E,R-^ 



Ta + (T Y.jt Eijt Ta + C7 Y.j, Eij, 

The posterior conditional distribution for j3y is 

7i;(j3;|Ai, a, T/3 , (7,E,R) oc ;:,(j3^.|ju, Tp)J^/(7;,-,|a,-, j3^-, (j) 

it 

(Tpjx/l + al^i,E,jr{Rij, - ai) 1 



j3;|;U,a,T/3,C7,E,R~^ 



V + (7 1;, Eij, T/j + CT Y^i, E, 



'Jt 
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The posterior conditional for is 



TT ( At I a , J3 , T„ , , E , R ) oc p ( ^ ) p] p ( a,. I ^ , Ta ) /7 ( J3 J I ^ , ) 

ij 



;U|a,j3,T„,Tp,E,R-^ 

The posterior conditional for a is 



c + 0.25mTa + 0.25nzp c + Q.lSmXa + 0.25nXp 



7r(cj|a,j3,E,R) oc p((j)]-[/(/;,^,|a,. j3^. a) 



(T|a,j3,E,R-- Gamma « + — ,^7+ i^E/^Yl^/jr - - /3;)^ 



The posterior conditional distribution for Ta is 

?r(Ta|Al,Ta,E,R) oc p(Ta)]~[/?(a/|^, Ta) 



T„|;U,T„,E,R- Gamma ( a + + i ^(a,- - iju)^ 



The posterior conditional distribution for is 

7r(T^ |J3 , M , E, R) oc p(Tp ) /7(j3y , 1^ , ) 



T/3|J3,M,E,R^ Gamma (^a + + 5 ^(i^J' " 5M)^J • 

6.4 Results 

Tables [Hand |2]show the posterior means of the parameters with their corresponding 95% HPD intervals. 
The autocorrelation plots of the parameters presented in Figure[3]and Figure|4] show that mixing is not very 
good since there is significant dependence even at lags greater than 30. The posterior parameter estimates are 



almost identical to those obtained by Klugman (1992) even though mixing does not appear to be good. 



In the next section, we reparameterise the model given in Section [6721 This reparameterisation results in 
improved mixing for a and the results were marginally better for j3. Figures [3] and |4] are the autocorrelation 
plots for the first implementation, while Figures |5]and|6] show the corresponding plots for the reparameterised 
implementation. The new parameterisation used is the corner point constraint, where one of the state or 
occupation effects is fixed at zero. Without loss of generality, we assume that a\ and j3i are both identically 
0. The reparameterised model is presented in Section lTTI 

We also investigate whether the state effects (/) are substantial and also whether the occupation effects 
i j) are non-trivial. For this, we employ the reversible jump algorithm. In the absence of a natural jump 
function, we propose to change all model parameters when moving between models. This is not strictly 
necessary. However, if parameter values change substantially between models, then keeping such values 
fixed will result in high reject rates at the accept/reject stage of the Reversible jump algorithm. We discuss 
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the reversible jump algorithm for this example in Section l872l 



Table 1: Estimates of the model parameters, with 95% HPD Intervals. 



mean 95% HPD Interval 





u.uuyj 


(-U.UUoj, U.U/ /U) 


ai 


0.0492 


(0.0310, 0.0676) 




0.0993 


(0.0687, 0.1305) 


0(4 


0.0310 


(0.0134, 0.0487) 


as 


0.0068 


(-0.0113,0.0246) 


oe 


0.0322 


(0.0131,0.0507) 


a? 


0.0100 


(-0.0090, 0.0290) 


ag 


0.1109 


(0.0926, 0.1291) 




0.0067 


(-0.0131,0.0263) 


0:10 


0.0525 


(0.0312, 0.0738) 



7 Reparameterisation Issues 

We reparameterise the model to allow for more freedom of the first level parameters. In model (|2|, both a,- 
and j3; are restricted to have mean /i/2. There is, however, no direct interpretation of the model parameters, 
given this parameterisation 

In addition, we reparameterise the model so that each state effect and occupation effect are compared 
with the first level. For this, we set both a\ and j3i equal to zero. This is the well known corner point 



constraint ( Venables and Ripley 1999 Section 6.2). The reparameterised model is 



7?,-,v|Ai,a,j3,(7-^(Ai + a, + j3,,(cj£„-,)-'), ai = 0, j3i = 0. (11) 

We use multivariate normal priors for a = (a2, . . . , «„,)' and j3 = (j32, . . . , j3„)'. The derivation of the posterior 
conditional distributions are given in the next section. 

The reparameterisation does not affect the final interpretation of the model. It could, however, affect the 



convergence rate of the Gibbs sampler Problems such as these are considered by Papaspiliopoulos et al. (2003) 
who show that the centered parameterisation is not uniformly superior to the non-centered parameterisation, 
and indeed that a partially centered parameterisation might give the fastest convergence rate of the Gibbs 
sampler We also note that the autocorrelations have decreased because of the parameterisations chosen. 
Generally the autocorrelation for hierarchical models depends on the parameterisation chosen. See for exam- 
ple |GiEsand^ober^^ and |Gelfand(1996a)| 

7.1 Using Non-Centred Block Updates 

Let a = (a2, ■ ■ ■ , o:,,,)', j3 = (j32, ■■■ ,l5„y and, as before, let R = {Rijr} denote the loss ratios. The first level is 
described as 

Rijr^.yr{ix + ai + l5j,{aE,jr)-^) , ai =0, jSi =0. 
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Table 2: Estimates of the model parameters, with 95% HPD Intervals. 



mean 95% HPD Interval 



Pi 


0.0711 


(0.0458, 0.0967) 




0.0792 


(0.0545, 0.1042) 


ft 


0.0206 


(0.0009, 0.0403) 


n 

ft 


-0.0269 


(-0.0667, 0.0131) 


n 

ft 


0.0539 


(0.0351, 0.0729) 


n 

ft 


0.1873 


(0.1684, 0.2060) 


n 

Pi 


0.0924 


(0.0606, 0.1243) 




0.0532 


(0.0246, 0.0822) 


ft 


0.0120 


/ r\ f\f\'—i A f\ /^'^ 1 

(-0.0074, 0.0317) 


o 

Pio 


0.0360 


(0.0160, 0.0561) 




0.0206 


(0.0026, 0.0387) 


J3l2 


0.0308 


(0.0111, 0.0503) 


i3l3 


0.0392 


(0.0194, 0.0592) 


J3l4 


0.0515 


(0.0326, 0.0708) 


fts 


0.0816 


(0.0617, 0.1017) 


Pl6 


0.0306 


(0.0114, 0.0503) 


o 

Pl7 




(-U.UUji, U.U4y4) 


i3l8 


0.0178 


(-0.0162, 0.0516) 


J3l9 


0.0256 


(0.0074, 0.0440) 


J320 


0.0143 


(-0.0058, 0.0346) 


ftl 


0.0307 


(0.0119, 0.0494) 


ft2 


0.0034 


(-0.0170, 0.0237) 


J323 


0.0357 


(0.0137, 0.0578) 


ft4 


-0.0003 


(-0.0178,0.0176) 



Given the first above, we choose the following prior distributions for the model parameters 

H^^{0,z^^) , a-^(0,T„'/) , )3 -^(o,T^ '/) , and CT~ Gamma (fl,/7), 

where T^j = Ta = = 0.001, and a — b — 0.001, so that these prior distributions are vague and flat. The law 
of {a,l5 given the data is 

Ki^,a,p,a\R) oc p{a)p{ii)p{a)p{p)L{R\ii,a,p,o). 

The updating scheme used is the deterministic updating strategy with components ((7, pL, a, j3). Other updat- 
ing schemes are possible such as grouping two or more of the parameters given above ( IRoberts and Sahu 1997I ). 
An initial implementation using a random walk metropolis algorithm, which updates all model parameters at 
once, using a covariance matrix estimated from a trial run, did not perform well. The acceptance rate for that 
algorithm was 0.857% and the autocorrelations were large even at lags greater than 50. 

7.2 Posterior Conditionals 

The posterior conditional for jx is 
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We determine that ^ has a normal distribution with mean 



+ (7 Y.Eijt a Y^Eijt {Rijt - at - Pj) , 

\ ijt J \ ijt ) 



and variance 



The posterior conditional of a is 



K{a\^i,l3,(j)ocp{a)L{R\n,a,l3,(j) 

oc exp I - £ af I exp I - f - M - - ^jf | • 

After simplification, we observe that the posterior conditional of a, given the other model parameters, is 
multivariate normal with mean vector given by 







... 



\ ' ( G^j,E2jt{R2jr-fl-pj) \ 

oZjtEijt{Rij,-H-^j) 



\ 

and variance matrix 







'^a + oY^jtE'ijt ... 








\ ... Ta + al,jtE,n,itJ 

The posterior conditional distribution for J3 is given by 

7r(/3 , a, C7) oc p(j3 )L (/?|;u , a , j3 , c7) 

oc exp I - 1 £ J3, 1 exp I - f - M - a, - ^jf | 

from which we determine that J3 follows a multivariate normal distribution with mean vector 



l^ + fy'LitEat ... 

T^ + C7l,,£,3r ■•• 



-1 



V 



/ o"LitEi2t{Ri2t-li-ai) \ 
oY,i,Eat(.Rat-^i-ai) 
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and variance matrix 







'i2t 















V 











The precision parameter a, has posterior conditional distribution which is a gamma distribution with shape 
and scale parameters, respectively, given by 



7.3 Results and Model Interpretation 

We implemented the Gibbs algorithm for the model described in Equation (fTTT) using the conditional distri- 
butions derived above. The resulting mean of the posterior parameter distributions are given in Tables [3] and 
m Posterior 95% HPD intervals for each parameter are also given. Recalling that the quantity of interest is 
/i + a,- + Pj, the negative values of some of the parameters are irrelevant, as these are off-set by the value of jj.. 
Therefore, the a values are all relative to ai which was fixed at 0. If instead, we fixed ai at some other value, 
the other a values would have compensated for this by changing. In particular, if we fix ai at the minimum 
observed value of 0.0035 (ag) then all the a's would be positive. Likewise, fixing a\ at 0.1015 (ag) would 
result in all other a's being negative. In each case ,/x would also change so that jU + a, + j3/ remains constant. 
A similar discussion holds for the jSj values observed. 

In later sections, we allow for models which try to explain the data using only the assumptions of depen- 
dence on state (index /) or on occupation (index j) only. Our results will show that such models are unlikely 
to give adequate description of the underlying process generating the data. 



The lag-k autocorrelation values are not directly comparable across the models. However, the smaller the 
absolute values of the correlation, the better the chain is mixing. A comparative error plots of the state and 
occupation effects are shown in Figures [T] and [8] 



Table 3: Parameter Estimates for the Non centred Parameterisation. 



Parameter Posterior mean 95% HPD Interval 



a2 0.0394 (0.0316,0.0473) 

Oi 0.0966 (0.0680,0.1245) 

0.0215 (0.0161,0.0268) 

as -0.0027 (-0.0099, 0.0041) 

ag 0.0223 (0.0131,0.0314) 

a? 0.0001 (-0.0095, 0.0099) 

ag 0.1015 (0.0932,0.1093) 

ag -0.0035 (-0.0149, 0.0076) 

aio 0.0430 (0.0284,0.0571) 
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(a) a2. 



(b) as. 



(c) 0!4. 



J„LjJ„Ll 
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(d) as. 



(e) aft. 



(f) a,. 
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(g) ag. 



(h) a,. 



(i) aio. 



Figure 5; Autocorrelation plots for a. The dotted lines are 95% confidence bands. 
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Figure 7: Boxplots on the State effects a. 
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Figure 8: Boxplots on the Occupation effects J3. 
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Table 4: Parameter Estimates for the Non centred parameterisation. 



Parameter Posterior mean 95% HPD Interval 



Pi 


0.0083 


(-0.0186 0.0361) 


n 

ft 


-0.0524 


(-0.0748 -0.0302) 


ft 


-0.1140 


(-0.1583 -0.0702) 


ft 


-0.0186 


(-0.0397 0.0031) 


ft 


0.1155 


( 0.0942 0.1366) 


ft 


0.0256 


(-0.0089 0.0610) 


ft 


-0.0179 


(-0.0503 0.0135) 


ft 


-0.0610 


(-0.0833 -0.0387) 


o 

Pio 


-0.0367 


(-0.0592 -0.0140) 


ftl 


-0.0522 


(-0.0732 -0.0317) 


J3l2 


-0.0419 


(-0.0647 -0.0202) 


J3l3 


-0.0333 


(-0.0559 -0.0110) 


J3l4 


-0.0208 


(-0.0430 0.0005) 


^15 


0.0094 


(-0.0128 0.0323) 


o 

P16 


-0.0421 


(-0.0645 -0.0202) 


o 

Pn 


-U.Ujio 


(-U.Uo/j -U.U2Z / ; 


fts 


-0.0581 


(-0.0956 -0.0205) 


J3l9 


-0.0471 


(-0.0677 -0.0258) 


fto 


-0.0587 


(-0.0813-0.0356) 


ftl 


-0.0420 


(-0.0636 -0.0207) 


ftl 


-0.0699 


(-0.0931 -0.0469) 


fta 


-0.0371 


(-0.0615 -0.0122) 


ft4 


-0.0731 


(-0.0935 -0.0525) 



8 Model Discrimination 

This section introduces other models which are alternative models for explaining the data. We then compare 
these one-way models with the two-way model using the reversible jump method. The implementation of 
the reversible jump algorithm will be based on the centering and scaling proposals of ' Brooks et al. (2003)] 
with some modifications. The reversible jump algorithm was introduced in Section l4TT] as a method of model 
selection and discrimination. Model M\ is the full model considered previously and has first level 

Rij,\pL,a,^,G,E ^ ^ {n + ai + ^j,{GEij,)-^) , ai = 0, jSi^O. 

Model M2 seeks to explain the data based on the State only. It has first level description of the data given by 

R,j,\pL' ,a' ,0' ,E ^ {pl' + a[,{o' E,j,)-^) , a[=0. 

The third model, denoted by M3, has first level 

R,p\pi\fi'\a'\E ^ ^(Ai" + ft", ((T%y)-i) , = 0, 

where the explanatory variables now depend on the occupations, indexed by j. Model M\ is designed to 
test the hypothesis that both occupation and class effects are observed in the loss data, while models M\ 
and M2 are designed to test the hypotheses that one of these factors is missing from the observed data. We 
could plausibly add a fourth model to test the absence of both class and occupation effects. However, for our 
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analysis, we assume that there is at least one of these effects present. 
8.1 Die Results 

We computed the DIC values for each model. The results are tabulated in Table |5j The results are consistent 



Table 5: Z)/C Results. 



Model 






Pd 


DIC 


Ml 


-2687.25 


-2721.19 


33.94 


-2653.31 


Ml 


-1643.03 


-1653.91 


10.88 


-1632.15 


M3 


-2114.05 


-2138.95 


24.90 


-2089.15 



with the number of parameters and the hierarchical levels in each model. They show that by using a hierar- 
chical approach we have essentially lost a fraction of a parameter in each model. Since 1 1 1 = 34, 1 62 1 — 1 1 . 
and 1 03 1 =24 the fraction lost is seen to be very small though. 

8.2 Reversible Jump using Automatic Proposals 

In this section, we use the reversible jump algorithm to explore the possibility that the data are generated 
by some process which depends on the state only, or even the occupation class only. To implement this, 
we now introduce two additional models. Both models are actually sub-models of the more general model 
introduced in Section|2] Proposing /i, a, j3 is desired, since we can take correlations into consideration. In all 
observed cases, the off diagonal elements are negative reflecting the fact that when pL increases, for example, 
a,- decreases. 

For between-model moves, we propose to update all parameters when we change model, so that the 
acceptance rates are of the form given in Section 14.21 This should help with between-model moves as 
proposed parameters will be close to their modal values when we change model. The posterior distributions 
for (/i,a,j3), {n',a'), and (;U",j3") in models Mi, M2 and M3, respectively, are not standard. Using the 
efficient proposal scheme of [Brooks et al. (2003)] we can find a Gaussian density which approximates the 
posterior conditional for these parameters. 

Notice that in all the given acceptance probabilities, the Jacobian term is 1 since we simulate new values 
for all the model parameters . Given that we are now at model M, , we propose a move to model Mj according to 
the probability matrix (r,y). For our implementation, we take r,y = j,i =^ j. The inverse variance parameter 
a differs across each of the three models. We therefore propose to change a when we change models by 
simulating new values form its marginal posterior distribution. 

An initial attempt using weak non-identifiable centering to derive the parameters for the proposal density, 
did not work well. The algorithm when started in a particular model, remains in that model and does not 
explore the entire model space ^ = {Mi,M2,M3}. Further analysis showed that by using non-identifiable 
centering, the proposed parameters are usually in an area of very small posterior probability, and since there 
are lots of data, the posterior density of the proposed model is very small compared with the current model 
and hence a small acceptance probability results. Weak non-identifiable centering does not work well for this 
example, perhaps because of the large number of parameters which are added or removed at each iteration. 

Consequently, we instead propose the conditional maximisation approach. In the conditional maximisa- 
tion approach, the centering point is chosen close to the posterior mean or mode so that the joint distribution 
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is maximised. The remaining parameters are then derived using the li^ order methods described earher. For 
implementation of the conditional maximisation scheme, we 



• Run each model and compute posterior mean/modes 



Use these posterior estimates as the centering point in the F'' order scheme 



• The method does not use weak non-identifiability centering since likelihood of both smaller and larger 
model are not identical at the centering point. 

8.3 Moves Between Models Mi and M2 

For moves between models M\ and M2 we consider the ratio A21 defined as 

7z{ii,a,l5,a) p{Mi) rn qifi' ,a' ,a') 



A21 



7t{^i',a',a') p{M2) r2\ q{ii,a,^,a) 
K{fi,a,l5,a) p{Mi) q{fi',a',a') 



when rtj = j. The posterior marginal distributions for a and <j' are different under both models. Therefore, 
setting a = g' will not work very well for between-model moves. We therefore simulate a from its posterior 
marginal distribution then simulate {jj.,a,l5) from q{ii,a,l5\G). Likewise, for the density q{li' , a', ff'), we 
first simulate o' from its posterior marginal distribution, then simulate {lJ.',a') ~ q{fl' ,a'\G') so that the 
acceptance term becomes 

^ ^ n{n,a,li,a) pjMi) q(ii' ,a'\a')q{a') 
^' n{ii',a',a') p{M2) q{ii,a,^\a)q{a)' 

Taking logs, we have logAii = 

log7r(^,a,j3,a) + \ogq{ii' ,a'\a') - \ogn{^' ,a' ,a') - log^(^,a,j3|ff) + Kn, (12) 

where K12 contains terms not involving (/x, a,j3) or (/i', a'). We now recall the updating scheme proposed 
earlier. We begin by simulating new values for the precision parameters a and a' depending on the proposed 
move. If the move is of type M\ ^ M2, then we simulate a new value for a'\ otherwise, if the move is of 
type M2 Ml, a new value for a is simulated from the posterior marginal density. Having simulated new 
values of the precision parameters, we can then simulate new values for the model parameters depending on 
the type of move. Using this strategy, we can then see that the expression on the right of Equation (fT2]) can 
be decomposed into four distinct components when deriving the proposal densities. Since 

7:{ix,a,j3,a) p{Mi) q{n' ,a'\a')q{a') 

A21 — 



n{ii',a',a') p{M2) q{n,a,l5\a)q{a)' 
the proposal density parameters for moves involving model Mi, ji, a and j3, can be obtained from 



V*^ log ;r , a , )3 , (7) - log ^(^ , a, )3 1 ct) 



= 0, 

(A,«,J3) 
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where V= (<9/<9;U,<9/<9a,<9/<9j3)'. Substituting 



exp <^ - - V + 1^ a,i^ + ^ j3 2 + (J £ Eijt {R,j, -ii-a,~^jf 



•jt 



-+£1-1 



exp{—ba} 



and by also assuming that ^(/x, a, j3 1(7) is a Gaussian density with variance matrix Li and mean vector mi, 
we then derive the following proposal density parameters. 



The mean vector satisfies 



which can be solved to give 



E-'^ - 











a \ -nil ] = 

















Va/ 






^ = (5/5)3), and 




n 

;=2 


ijt 



Using Equation (fTZt . we can also derive the parameters for the proposal density for (/x', a') for moves 
involving model M2. To derive the proposal parameters for jumps involving model M2, we consider only the 
terms involving pi' and a' . Thus, we solve 



VMog 7r(M', a', cj') - log^(Ai', «>') 



(A',a') 



at our chosen centering, (/!', a'), point to get proposal parameters for a' | a'), where V = {d /djj.' ,d /da') 
Substituting 



Kin ,a ,a)oc 



exp • 



'^2 
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and assuming that q{ix', a'\(j') is a Gaussian density with variance matrix E2 and mean vector OT2. 



^2 



and m2 satisfies 



which gives the mean vector 



where = {d/d^'), and Va = {3/ da') and 



8.4 Moves Between Models Mi and M3 

The acceptance probability for a proposed move between models M\ and M3 is given by min{ 1 , A31 } where 



A3l = 



n{pi",^'\a")p{Mi) q{fi,a,l5,a) 



Since the parameter values change between models, we first propose a new value of a and then simulate new 
values for the other parameters given this value of a. Thus, the term A31 can further be written as 

7r(Ai,a,j3,cT) p{M,) q{fi" ,l5"\a") q{a") 



7z{fi",l5",a")piM,) q{^i,a,p\a) q{a) 



If we split the term logA3i into terms involving (/x, a,j3) and {n",j3"), we can see that q{^,a,P\a) will be 
identical to those derived in Section |831 since logA3i ~ 



log;r(M,a,/3,cj) - ?(m,«,/3|cj) - log;r(M", j3", a") + ^(m",/3"|ct") + Ki^ 



We therefore refer to Section [83] for the proposal parameters involving model Mi. For moves from model 
M3 to model Mi , we are increasing the size of the parameter space by 23 and such big changes in the size of 
the parameter vector can have small acceptance rates. For the reverse move, we decrease the parameter space 
from 34 to 11, a decrease of 23. Since the parameter values and interpretations change between models, we 
propose to change all parameter values when we change models. This means that even though the models are 
nested, 'down' moves are not deterministic. 

Assuming now that q{n'\^"\a") is Gaussian with variance matrix Z3 and mean vector m^, then solving 



V^log;r(M",/3",(T")-log^(M",i3'V" 



(A"-/3 
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where {p,",^") is the centering point, yields 

""°(r)"''(v)' 

id/d^"), and 

7=2 (yf 

8.5 Moves Between Models M2 and M3 

The acceptance for moves between models M2 and M3 is given by min{l,A23} where 

^ _ ;r(M^/3^ a")/HM3) ^/(m^^'-c^Q 
K{n',a',a') p{M2)q{n",^",a"y 

We further note that both models have different interpretation, and in addition, there seems to be no natural 
diffeomorphism between the parameters in both models. The precision parameter a is clearly different under 
both models. We use again our strategy of simulating a then simulating the remaining model parameters 
based on this new value of a. To reflect this the term A23 can therefore be written as 

K{^i',a\a') p{M2) q{n",p"\a")q{a"y 

For all the derived covariance matrices and mean vectors, the between-model moves seem to be accepted with 
greater frequency, if we use a point close to the posterior modes as the centering point. Using values dispersed 
with respect to the posterior modes still result in the same stationary distribution. However, the proposal 
parameters decline in quality if we do so. This results in fewer between-model moves being accepted. For 
example when moving between models M2 and M3, centering at will result in the same covariance matrix, 
since the covariance matrix depends only on the data and a. However, in computing the mean vector nik, the 
values may not be close to the posterior modes which may also result in proposed values being in a part of 
the space with very low probability. 

These values will not affect the prior ratio or the proposal ratio. However, these values will result in a 
small value for the likelihood and consequently a small value for the acceptance probability. This seems to be 
dependent on the data, as in this case, there are lots of data available which leads to a dominating UkeUhood. 
Thus small changes in the value of parameters can lead to disproportionately large changes in the likelihood 
function. In fact, no such between-model moves were observed in our simulations when the centering point 
was not near the posterior modes. 



and 



so that 



where = {d/dn"), and Vo 
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8.6 Simulation Study 

To investigate the possibility that model Mi is superior simply because it has more parameters, we simulate 
several datasets from models M2 and M3, and apply the algorithm to see which model has greatest posterior 
probabiUty. To test the accuracy of the reversible jump model selection we simulate several datasets from 
models M2 and M3 . We then apply the algorithm to these datasets to observe the posterior model probabilities. 
If the algorithm is working correctly then the model from which the data are simulated should have the highest 
posterior probabihty. For datasets that are simulated from model M2, both models Mi and M2 were able to 
estimate these parameters; however model M3 could not. For the reversible jump algorithm however model 
M2 had posterior probability equal to 1. 

For data simulated from model M3, both models Mi and M3 are able to estimate accurately the parameter 
values. Model M3 had posterior probability equal to 1 . These results are consistent with what we expect. The 
model Ml was able to fit the data simulated from both models M2 and M3, since both are sub-models of the 
bigger model. However model M2 could not adequately describe the data simulated from model My, likewise 
model M3 could not adequately describe the data simulated from model M2. For data that are simulated from 
model Ml both models M2 and M3 provided rather poor fits. In each case the algorithm chose the model 
from which the data was simulated with probability 1 . This means that having additional parameters does not 
provide a better fit to the data simulated from the smaller models. 

8.7 Sensitivity to Prior Parameters and Centering Point 

The results of our analysis are not sensitive to the prior distributions, which is good. The convergence of our 
algorithm, however, depends on the choice of a suitable centering point. A choice of centering point close to 
the posterior mean of the model parameters results in an algorithm which converges much faster. Our method 
is firstiy to run each model individually and record the posteriors means once the algorithm has converged. 
These values are then used as our centering point. 

Centering at the posterior modes is not strictly necessary for the algorithm to work. However, since 
moves of type (M2 M3) are not between nested models, centering at the posterior modes provides a use- 
ful guide for determining the mean vector and covariance matrix of the proposal densities q{ii' ,a'\G') and 
q{lJ." ,ji"\o") since models M2 and M3 are not nested. Generally, centering at posterior modes allow for 
non-nested model moves. If we considered only moves of type (Mi <s=> M2) or {Mi <^ M3) then the choice of 
a centering point would not matter since both models M2 and M3 are nested sub-models of model Mi . 

8.8 Reversible Jump Results 

The fuU model with both state and occupation parameters is preferred to the restricted models with either 
state only or occupation only parameters. The posterior probability of the fuU model is approximately 1. It 
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Figure 9: Posterior model probabilities. The prior probabilities have been chosen so that the models have 
approximately equal posterior probabihties. 



could be the case that Mi is preferred simply because it has more parameters and so is better at explaining the 
data. However, as we shall explore in Section |876l with simulated data, if the data are from models M2 or M3 
both models would be selected with probability 1 and would be preferred to the more complex model Mi . 

Even though the posterior model probabilities for M2 and M3 are very small, practically 0, we can still get 
the reversible jump algorithm to explore the model space by choosing appropriate prior model probabilities. 
For convenience, i.e. for the algorithm to jump between models we take p{Mi) / p{M2) = e^^^'^ , p{Mi) / p{M^) 
= e^^^^ and hence p{Mj,) / p{M2) = e^'^**. These priors allow the algorithm to explore all three models. The 
resulting transition matrix, where the (ij) term gives the probability of moving between model M,- and Mj, 
is 





Ml 


M2 


M3 


Ml 


1 0.554 


0.233 


0.213 


M2 


0.208 


0.565 


0.227 


M3 


1^0.198 


0.246 


0.556 



which has limiting probabilities (0.308,0.358,0.334). Taking into consideration the prior model probabili- 
ties, the results clearly indicate that the full model Mi is more likely to describe the process which generated 
the data. Models M2 and M3 are less likely to describe such a process. The centering points used are the 
posterior means of the model parameters. For the scale parameters, we simulate those from their marginal 
posterior distributions, and then simulate the other proposal parameters conditional on this value of the scale 
parameter. The results clearly show the k{Mi \R)k,\, 7r(M2|^) « 0, and n{MT,\R) « 0. To assess convergence 
of the algorithm, we simulate 3 chains using different starting values and different random number seeds for 
a total of 100000 iterations. Both the ;|f-square and Kolmogorov-Smirnov diagnostics are computed. These 
diagnostics are plotted in Figure [TT] In both cases, the diagnostic is well above the critical 5% value. The 
methods employed in this paper are not exactly weak non-identifiable centering methods. The approach that 
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Figure 10: Model trace indicator for the reversible jump algorithm. 
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Figure 11: Convergence diagnostics. 
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we use is general enough so that down moves, say from model Mi to models M2 or M3, are not deterministic. 
This approach also allow the use of the posterior mean as the mean of or proposal density, from which the 
posterior variance matrix can be derived using the centering methods described. 



9 Summary 

Anova models arise in many areas of insurance credibility theory. The Biihlmann-Straub model can be repa- 
rameterised as a one-way model. In this paper, we use posterior model probabilities to compare one-way and 
two-way models. The results can be extended to cover yet more general models such as the |Jewell (1975)| and 
Taylor (1974) models. Using the conditional maximisation scheme of [Brooks et al. (2003)[ we constructed 



an algorithm which can be used to compute posterior model probabilities for model discrimination. When 
applied to loss ratios extracted from datasets of worker's compensation insurance in the United States, there 
is overwhelming posterior odds in favour of the full model. This seems quite plausible given the structure 
and size of the data used. Model discrimination measures computed using the deviance information criterion, 
Die, give results which are consistent with the reversible jump model probabilities obtained. 
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